Skip to content

add structured error types and retry logic for 429/5xx#7

Open
ayaan-kapoor wants to merge 2 commits into
gumloop:mainfrom
ayaan-kapoor:feat/retry-and-error-hierarchy
Open

add structured error types and retry logic for 429/5xx#7
ayaan-kapoor wants to merge 2 commits into
gumloop:mainfrom
ayaan-kapoor:feat/retry-and-error-hierarchy

Conversation

@ayaan-kapoor

Copy link
Copy Markdown

What

  • Add specific APIStatusError subclasses dispatched by status code:
    BadRequestError (400), PermissionDeniedError (403), NotFoundError (404),
    UnprocessableEntityError (422), RateLimitError (429), ServerError (5xx)
  • Add retry logic with exponential backoff to HttpClient and AsyncHttpClient
  • Surface max_retries on Gumloop and AsyncGumloop constructors (default: 2)

Why

Right now everything non-2xx raises a generic APIStatusError, so callers
have to inspect .status_code manually to know what went wrong. Specific
subclasses make error handling a lot cleaner — you can just catch
RateLimitError or NotFoundError directly.

The retry logic handles transient 429s and 5xx failures automatically.
Only server-side errors are retried, never 4xx. Respects Retry-After
when present, otherwise falls back to exponential backoff with jitter.
max_retries=0 disables it entirely if needed.

Tests

18 new tests in tests/sdk/test_errors.py.

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds a typed error hierarchy (BadRequestError, PermissionDeniedError, NotFoundError, UnprocessableEntityError, RateLimitError, ServerError) dispatched by HTTP status code, and exponential-backoff retry logic with Retry-After support to both sync and async HTTP clients.

  • Error dispatch: to_api_error now returns the most-specific subclass via a _STATUS_MAP lookup, falling through to ServerError for any 5xx and to the base APIStatusError for unmapped 4xx codes.
  • Retry logic: _should_retry correctly blocks 5xx retries on non-idempotent methods (POST, PATCH) while allowing 429 retries on all methods; _parse_retry_after handles both integer-seconds and RFC 7231 HTTP-date formats.
  • Public surface: max_retries (default 2) is surfaced on both Gumloop and AsyncGumloop constructors and forwarded to the underlying transport clients.

Confidence Score: 5/5

Safe to merge — the retry and error-dispatch logic is correct, idempotency guards work as described, and Retry-After HTTP-date parsing is properly implemented.

The core changes are well-structured: the idempotency guard correctly prevents POST/PATCH 5xx retries, _parse_retry_after handles both header formats, and the error class hierarchy is cleanly dispatched. The only findings are a misleading backoff comment and an async test that skips patching sleep — neither affects runtime correctness.

No files require special attention.

Important Files Changed

Filename Overview
src/gumloop/errors.py Adds six typed error subclasses and a _STATUS_MAP dispatch table. to_api_error correctly falls through to ServerError for any 5xx not in the map, and to the base APIStatusError for unknown 4xx codes.
src/gumloop/_http.py Introduces retry helpers (_should_retry, _retry_delay, _parse_retry_after) and wraps both _request and post_to_stream_host in a retry loop. POST/PATCH 5xx are correctly blocked from retrying; Retry-After HTTP-date format is now handled. A misleading module-level comment describes the backoff formula incorrectly.
src/gumloop/_client.py Threads max_retries through both Gumloop and AsyncGumloop constructors and forwards it to HttpClient / AsyncHttpClient. Straightforward plumbing with no issues.
src/gumloop/init.py Re-exports the five new error classes from the public API surface. __all__ is correctly updated.
tests/sdk/test_errors.py 18 new tests covering error dispatch, sync/async retry, idempotency guards, and Retry-After parsing. test_async_retries_on_500_then_succeeds does not patch asyncio.sleep, so it will actually sleep for up to 0.5 s non-deterministically.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HTTP request via _request / post_to_stream_host] --> B{status < 400?}
    B -- yes --> C[Return response]
    B -- no --> D[to_api_error typed subclass]
    D --> E{_should_retry?}
    E -- no: 4xx non-429, or 5xx on POST/PATCH --> F[raise exc]
    E -- yes: 429 or 5xx on idempotent method --> G{attempt < max_retries?}
    G -- no --> F
    G -- yes --> H[_parse_retry_after]
    H --> I{Retry-After header?}
    I -- seconds string --> J[use parsed float]
    I -- HTTP-date string --> K[compute delta from now]
    I -- absent --> L[exponential backoff with full jitter]
    J & K & L --> M[sleep / asyncio.sleep]
    M --> A
Loading

Reviews (2): Last reviewed commit: "address review: idempotent-only 5xx retr..." | Re-trigger Greptile

Comment thread src/gumloop/_http.py
Comment on lines +240 to +250
for attempt in range(self._max_retries + 1):
response = self._client.request(method, path, headers=headers, **kwargs)
if response.status_code < 400:
return response.json() if response.content else None
exc = to_api_error(response)
if attempt < self._max_retries and _should_retry(exc):
delay = _retry_delay(attempt, _parse_retry_after(response))
logger.debug("retrying %s %s (attempt %d, delay %.2fs)", method, path, attempt + 1, delay)
time.sleep(delay)
continue
raise exc

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-idempotent POST/PATCH retried on 5xx

_request is used by post() and patch() as well as get() and delete(), so the retry loop applies to all of them equally. A 5xx can be returned by a proxy or load balancer after the upstream already committed the write — retrying will then execute the mutation a second time. For example, client.agents.create(...) hitting a 502 from a gateway would create the agent twice before the caller sees any error. The retry should be limited to idempotent methods (GET, DELETE, HEAD) or guarded by a method check.

Comment thread src/gumloop/_http.py
Comment on lines +72 to +79
def _parse_retry_after(response: httpx.Response) -> float | None:
raw = response.headers.get("retry-after")
if raw is None:
return None
try:
return float(raw)
except ValueError:
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Retry-After HTTP-date format is silently dropped

RFC 7231 allows Retry-After to be either a delay-in-seconds (Retry-After: 30) or an HTTP-date (Retry-After: Wed, 21 Oct 2015 07:28:00 GMT). When the header contains a date string, float(raw) raises ValueError, the function returns None, and the code falls back to exponential backoff — which for attempt=0 is at most 0.5 s. If the server sent a date meaning "wait 30 s", the client will retry far too soon, burn through all retries, and still surface a RateLimitError to the caller while likely worsening the rate-limit situation on the server side.

Comment thread src/gumloop/_http.py
Comment on lines 175 to 233
@@ -184,7 +227,7 @@ def stream_typed(
yield response_model.model_validate_json(event.data)
except ValidationError:
# Server-side mid-stream error frames or schema-drift events
# land here.
# land here.
logger.debug("dropped non-%s SSE: %s", response_model.__name__, event.data)
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 stream and stream_typed have no retry logic

Both HttpClient.stream / HttpClient.stream_typed (and their async counterparts) still use the original single-attempt pattern. A 429 or 5xx on a streaming request will raise immediately, while the same status on a non-streaming request through _request or post_to_stream_host will be retried up to max_retries times. Callers who expect consistent retry behaviour regardless of whether they use the streaming or non-streaming variant will be surprised by this inconsistency.

Comment thread tests/sdk/test_errors.py
Comment on lines +78 to +90
@respx.mock
def test_retries_on_500_then_succeeds(client: Gumloop) -> None:
route = respx.get(f"{API_BASE}/agents").mock(
side_effect=[
httpx.Response(500, json={"error": {"message": "internal error"}}),
httpx.Response(200, json={"agents": []}),
]
)

result = client.agents.list()

assert result.agents == []
assert route.call_count == 2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test performs a real time.sleep during retry

test_retries_on_500_then_succeeds does not patch time.sleep, so the first retry will actually block for random.uniform(0, 0.5) seconds. While jitter can make this 0, it is non-deterministic and will occasionally slow down the test suite. All other retry tests that care about timing explicitly patch time.sleep — this one should too for consistency.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant